This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Fri Aug 15 12:37:57 2025.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
Objectives
The objective is to build a short-term forecasting model for daily rentals using an ARIMA model. Before ARIMA is implemented, the data is evaluated for stationarity - a pre-requisite for using ARIMA - and seasonal patterns.
Step 1 - Import and clean the data
# Import the data
data <- read.csv("day.csv") # data is a data frame
# check structure of dataframe
str(data) # 731 rows and 16 variables
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
# convert dteday from character to date format
data$dteday <- as_date(data$dteday)
# Replace numeric weekdays with labeled factor
data <- data %>%
mutate(weekday = factor(weekday,
levels = 0:6,
labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")))
Step 2 - Explore the data
glimpse(data)
## Rows: 731
## Columns: 16
## $ instant <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
## $ season <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday <fct> Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Fri…
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
# summary of the data to check for any weird values
summary(data)
## instant dteday season yr
## Min. : 1.0 Min. :2011-01-01 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 1st Qu.:2011-07-02 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Median :2012-01-01 Median :3.000 Median :1.0000
## Mean :366.0 Mean :2012-01-01 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:2012-07-01 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :2012-12-31 Max. :4.000 Max. :1.0000
##
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Sunday :105 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 Monday :105 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Tuesday :104 Median :1.000
## Mean : 6.52 Mean :0.02873 Wednesday:104 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 Thursday :104 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Friday :104 Max. :1.000
## Saturday :105
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
##
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
##
## We can see that RH ranges from 0%. This is a physical impossibility on Earth and not conceivable in the humid subtropical climate of DC. It may be an artefact.
## All variables are treated as continuous variables in summary(), resulting in binary entries - either 0 or 1 - like workingday, having median values between 0 and 1, but that has no meaning.
# generate deeper EDA report
DataExplorer::create_report(data)
##
##
## processing file: report.rmd
## | | | 0% | |. | 2% | |.. | 5% [global_options] | |... | 7% | |.... | 10% [introduce] | |.... | 12% | |..... | 14% [plot_intro]
## | |...... | 17% | |....... | 19% [data_structure] | |........ | 21% | |......... | 24% [missing_profile]
## | |.......... | 26% | |........... | 29% [univariate_distribution_header] | |........... | 31% | |............ | 33% [plot_histogram]
## | |............. | 36% | |.............. | 38% [plot_density] | |............... | 40% | |................ | 43% [plot_frequency_bar]
## | |................. | 45% | |.................. | 48% [plot_response_bar] | |.................. | 50% | |................... | 52% [plot_with_bar] | |.................... | 55% | |..................... | 57% [plot_normal_qq]
## | |...................... | 60% | |....................... | 62% [plot_response_qq] | |........................ | 64% | |......................... | 67% [plot_by_qq] | |.......................... | 69% | |.......................... | 71% [correlation_analysis]
## | |........................... | 74% | |............................ | 76% [principal_component_analysis]
## | |............................. | 79% | |.............................. | 81% [bivariate_distribution_header] | |............................... | 83% | |................................ | 86% [plot_response_boxplot] | |................................. | 88% | |................................. | 90% [plot_by_boxplot] | |.................................. | 93% | |................................... | 95% [plot_response_scatterplot] | |.................................... | 98% | |.....................................| 100% [plot_by_scatterplot]
## output file: C:/Users/d.wilkinson/Documents/R Portfolio/bikeshare_project_coursera/report.knit.md
## "C:/Users/DD9F6~1.WIL/AppData/Local/Pandoc/pandoc" +RTS -K512m -RTS "C:\Users\DD9F6~1.WIL\DOCUME~1\RPORTF~1\BIKESH~1\REPORT~1.MD" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc3f5015546028.html --lua-filter "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmarkdown\lua\latex-div.lua" --lua-filter "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmarkdown\lua\table-classes.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\DD9F6~1.WIL\AppData\Local\Temp\RtmpyqSRmA\rmarkdown-str3f5011a82518.html"
##
## Output created: report.html
# shows no missing values
# month, season, weathersit and weekday are discrete variables - shown in univeriate distribution
## qq plot shows how normally the data is distributed. It is intended for continuous variables - discrete or categorical variables such as weekday or season - are not appropriate for such a plot type.
# What effect does temperature have on the average number of bikes being hired?
data %>%
select(dteday, season, temp, weekday, cnt) %>%
group_by(season, weekday) %>%
summarise(avg_cnt = mean(cnt), mean_t_c = mean(temp * 41)) # adjusting the mean temperature to °C
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
## # A tibble: 28 × 4
## # Groups: season [4]
## season weekday avg_cnt mean_t_c
## <int> <fct> <dbl> <dbl>
## 1 1 Sunday 2229. 11.5
## 2 1 Monday 2453. 11.3
## 3 1 Tuesday 2793. 12.2
## 4 1 Wednesday 2611. 12.4
## 5 1 Thursday 2894. 13.1
## 6 1 Friday 2856. 12.9
## 7 1 Saturday 2432. 12.1
## 8 2 Sunday 4987. 22.4
## 9 2 Monday 4565 22.9
## 10 2 Tuesday 4825. 22.9
## # ℹ 18 more rows
# --> The hotter the temperature, the more bikes are being rented out
ggplot(data, aes(x = dteday, y = temp)) + geom_point() + labs(x = "Date", y="Temperature (°C)")
# the temperature fluctuates from approximately 6°C to 33°C over 2011. The lowest temperature seems larger at the end of than 2011 but no significance test has been run on it.
temp_adjusted = data$temp * 41
boxplot_temp_by_year <- data %>%
mutate(
temp_c = temp * 41
) %>%
ggplot(aes(x = factor(yr), y = temp_c)) +
geom_boxplot() +
scale_x_discrete(labels = c("0" = "2011", "1" = "2012")) +
labs(
x = "Year",
y = "Temperature (°C)"
) + theme_pubr()
boxplot_temp_by_year
# which peaks are offpeak, onpeak?
seasonal_peaks <- data %>%
group_by(season) %>%
summarise(total_rentals = sum(cnt), .groups = "drop") %>%
arrange(desc(total_rentals))
seasonal_peaks
## # A tibble: 4 × 2
## season total_rentals
## <int> <int>
## 1 3 1061129
## 2 2 918589
## 3 4 841613
## 4 1 471348
monthly_peaks <- data %>%
group_by(mnth) %>%
summarise(total_rentals = sum(cnt), .groups = "drop") %>%
arrange(desc(total_rentals))
monthly_peaks
## # A tibble: 12 × 2
## mnth total_rentals
## <int> <int>
## 1 8 351194
## 2 6 346342
## 3 9 345991
## 4 7 344948
## 5 5 331686
## 6 10 322352
## 7 4 269094
## 8 11 254831
## 9 3 228920
## 10 12 211036
## 11 2 151352
## 12 1 134933
In 2011, the temperature ranges from 2.4°C to 35.3°C with a median of 20.4°C. The temperatures are not significantly different in 2012 but appear to be slightly higher, especially when comparing the medians. The range of values in 2012 is also narrower.
The month-based peak season is in August, June and September. There is a clear drop in bike rentals in April, November, March and December. In terms of yearly seasons, Fall is the peak season, followed by Summer, Winter and then Spring.
In this task, three time series are plotted:
## Read about the timetk package
# ?timetk
# how many bikes are rented over time
riders_ts <- data %>%
plot_time_series(data$dteday, cnt, .y_lab = "Number of Bikes")
riders_ts
# how many casual riders are there over time
casual_rider_ts <- data %>%
plot_time_series(data$dteday, casual, .y_lab = "Number of Casual Riders")
casual_rider_ts
# how many casual riders are there over time by day of the week
casual_riders_by_weekday <- data %>%
group_by(weekday) %>%
plot_time_series(dteday, casual, .facet_ncol=2, .y_lab = "Number of Casual Riders", .title = "Number of Casual Riders vs Time by Week Day")
casual_riders_by_weekday
registered_riders_by_weekday <- data %>%
group_by(weekday) %>%
plot_time_series(dteday, registered, .facet_ncol=2, .y_lab = "Number of Registered Riders", .title = "Number of Registered Riders vs Time by Week Day")
registered_riders_by_weekday
Interpretation
There are consistently fewer casual riders taking trips on work days than on Saturdays and Sundays, suggesting casual users ride for leisure rather than commuting. There is a mid-year peak and at drop at the end of the year, consistent with seasonal leisure behaviour.
There are consistently more registered riders than casual riders. There is less fluctuation in the smoothed trend line across week days than the casual riders, especially Monday to Friday. This indicates that registered users are primarily commuters with regular weekday usage and reduced weekend activity. # Task Three: Smooth time series data
plot_time_series(
.data = data,
.date_var = dteday,
.value = cnt,
.smooth = TRUE, # enables smoothing
.interactive = FALSE, # set TRUE for plotly
.title = "Smoothed Daily Bike Rentals Over Time",
.y_lab = "Number of Riders",
.x_lab = "Date"
)
Interpretation
The black line is for actual daily rental counts and it is spiky due to the daily variability. The blue, smoothed Loess line shows the underlying long-term trend. There is rising adoption in Jan-June 2011 which may reflect a recent launch of the bike hire system or be due to weather changes. There is a plateau between July 2011 and December 2011 where growth stabilises. The market may be saturated by early adopters, or the uptake is affected by the end of summer or other external influences like limits to infrastructure. There is another growth phase between Jan and Sept 2012 which may be due to increased membership, more bike infrastructure e.g. stations, better weather and changes in commuting habits. Then there is a decline in September-December 2012 which may be due to colder weather, public holidays or other external disruptions. The plot shows macro trends (growth, plateau, decline) but not micro patterns (weekly seasonality and effect of public holidays). Below, the effect of weather is incorporated into the modelling.
The goal of this task is to break the series into trend, seasonality, and noise components, then assess whether the series is stationary. Stationarity means the mean, variance, and autocorrelation structure do not change over time — a key assumption for ARIMA modelling. Non-stationary data can lead to misleading correlations and poor forecasts.
Step 1 - STL Decomposition We use seasonal-trend decomposition via Loess (STL) to identify the trend (long-term direction of data), seasonality (repeating patterns at fixed intervals e.g. weekly) and remainder (residual noise after removing trend and seasonality).
p_stl <- plot_stl_diagnostics(
.data = data,
.date_var = dteday,
.value = cnt,
.frequency = "7 days",
.trend = "3 months",
.title = "STL Decomposition of Daily Bike Rentals",
.y_lab = "Count of Bike Rentals"
)
## frequency = 7 observations per 7 days
## trend = 90 observations per 3 months
decomp_tbl <- tk_stl_diagnostics(
.data = data,
.date_var = dteday,
.value = cnt,
.frequency = "7 days",
.trend = "3 months"
)
## frequency = 7 observations per 7 days
## trend = 90 observations per 3 months
Step 2 - ACT/PACF of Remainder
p_acf_remainder <- plot_acf_diagnostics(
.data = decomp_tbl,
.date_var = dteday,
.value = remainder,
.lags = 60,
.title = "ACF & PACF of STL Remainder (Deseasonalised Residuals)"
)
p_acf_remainder
Interpretation - ACF shows a large spike at lag 1 (~0.9) with slow exponential decay over lags 2–10, suggesting strong autocorrelation and non-stationarity.
PACF shows a large spike at lag 1, with subsequent lags near zero — consistent with an AR(1) process.
No strong weekly seasonal spike is visible.
Step 3 - Stationarity Test (ADF) on Raw Series
Using Augmented Dickey-Fuller (ADF) test to statistically confirm stationarity.
# Test for stationarity on the dataset
adf.test(data$cnt)
##
## Augmented Dickey-Fuller Test
##
## data: data$cnt
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary
## data: data$cnt
##Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
##alternative hypothesis: stationary --> p-value > 0.05 so fail to reject null hypothesis --> series is likely non-stationary
Conclusion for Task 4 - The series shows strong autocorrelation and fails ADF for stationarity. - Differencing (first and seasonal) is needed before fitting ARIMA.
Given the series is non-stationary, we apply differencing to achieve stationarity before fitting an ARIMA model. ARIMA has both seasonal and non-seasonal components:
General form: ARIMA(p, d, q)(P, D, Q)[s]
p Non-seasonal autoregressive (AR) order. p = 1, uses 1 lag of pass values (AR(1)) “yesterday’s value” d Non-seasonal differencing to remove trend. d = 1, first difference applied to make the series stationary. q Non-seasonal moving average (MA) order. q = 1, uses 1 lag of the error terms (MA(1)) These parameters capture the short-term autocorrelation and trend
P Seasonal AR order. P = 1, 1 seasonal lag of past values i.e. value 7 days ago D Seasonal differencing order. D = 0, no seasonal differencing - data was already de-seasoned or did not require seasonal diff. Q Seasonal MA order. Q = 2, 2 seasonal lags of the error terms (i.e. errors from 7 and 14 days ago) [s] Season length (period of the seasonality, e.g., 7 for weekly seasonality). s = 7, weekly seasonality. This captures the repeating weekly patterns in the data.
Step 1 - Differencing and Stationarity Check
# run test again after differencing
cnt_diff1 <- diff(data$cnt, lag = 1)
adf.test(na.omit(cnt_diff1))
## Warning in adf.test(na.omit(cnt_diff1)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(cnt_diff1)
## Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
## data: na.omit(cnt_diff1)
## Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary --> p-value < 0.05 --> reject null hypothesis --> series is stationary --> can use ARIMA
# run after seasonal and first differencing (weekly seasonality = 7 day lag)
cnt_diff1_7 <- diff(cnt_diff1, lag = 7)
adf.test(na.omit(cnt_diff1_7))
## Warning in adf.test(na.omit(cnt_diff1_7)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: na.omit(cnt_diff1_7)
## Dickey-Fuller = -16.339, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
## data: na.omit(cnt_diff1_7)
##Dickey-Fuller = -16.339, Lag order = 8, p-value = 0.01
##alternative hypothesis: stationary --> p < 0.05 --> likely stationary
# plot raw and differenced
# Raw series
acf_raw <- plot_acf_diagnostics(
.data = data,
.date_var = dteday,
.value = cnt,
.lags = 60,
.title = "Raw Series: ACF & PACF"
)
acf_raw
# First + seasonal differencing
data_diff <- data %>%
mutate(
cnt_diff1 = c(NA, diff(cnt, lag = 1)),
cnt_diff1_7 = c(rep(NA, 7), diff(c(NA, diff(cnt, lag = 1)), lag = 7))
) %>%
drop_na(cnt_diff1_7)
head(data_diff)
## instant dteday season yr mnth holiday weekday workingday weathersit
## 1 9 2011-01-09 1 0 1 0 Sunday 0 1
## 2 10 2011-01-10 1 0 1 0 Monday 1 1
## 3 11 2011-01-11 1 0 1 0 Tuesday 1 2
## 4 12 2011-01-12 1 0 1 0 Wednesday 1 1
## 5 13 2011-01-13 1 0 1 0 Thursday 1 1
## 6 14 2011-01-14 1 0 1 0 Friday 1 1
## temp atemp hum windspeed casual registered cnt cnt_diff1
## 1 0.138333 0.116175 0.434167 0.361950 54 768 822 -137
## 2 0.150833 0.150888 0.482917 0.223267 41 1280 1321 499
## 3 0.169091 0.191464 0.686364 0.122132 43 1220 1263 -58
## 4 0.172727 0.160473 0.599545 0.304627 25 1137 1162 -101
## 5 0.165000 0.150883 0.470417 0.301000 38 1368 1406 244
## 6 0.160870 0.188413 0.537826 0.126548 54 1367 1421 15
## cnt_diff1_7
## 1 47
## 2 -49
## 3 -271
## 4 -139
## 5 238
## 6 111
acf_diff <- plot_acf_diagnostics(
.data = data_diff,
.date_var = dteday,
.value = cnt_diff1_7,
.lags = 60,
.title = "Differenced Series: ACF & PACF"
)
acf_diff
Interpretation Before differencing : ACF: large spike at lag 1 with slow decay –> strong autocorrelation PACF: large spike at lag 1, others near 0 –> AR(1) behaviour –> non-stationarity
After differencing: ACF: immediate drop after lag 0, no prolonged decay –> reduced autocorrelation PACF: no significant spikes after lag 0 –> stationarity confirmed –> ARIMA can be used
Step 2 - Fit ARIMA model
# Modeltime-friendly version
model <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(cnt ~ dteday, data = data)
## frequency = 7 observations per 1 week
model_tbl <- modeltime_table(model)
model_calibrated <- model_tbl %>%
modeltime_calibrate(new_data = data)
Step 3 - Forecast with Confidence Intervals
model_calibrated %>%
modeltime_forecast(h = "30 days", actual_data = data, .conf_interval_show = TRUE) %>%
plot_modeltime_forecast(
.conf_interval_alpha = 0.2,
.title = "30-Day Forecast with 95% Confidence Intervals"
)
Interpretation:
The 95% confidence interval (CI) indicates the range within which the true mean number of bike rentals is expected to fall 95% of the time.
CI is narrow in the short term (first week), moderately wide in weeks 2–3, and widens substantially after that — reflecting increasing uncertainty over time.
Historical data shows large swings (e.g., April and November 2012), so some volatility in the forecast is expected.
CI is symmetrical, suggesting no systematic bias.
Step 4 - Comparison of ARIMA and ARIMAX Models
library(dplyr)
library(timetk)
library(modeltime)
library(parsnip)
# 0) Clean & features ----------------------------------------------------------
data <- data %>%
mutate(
dteday = as.Date(dteday),
temp_c = temp * 41
) %>%
arrange(dteday)
# Baseline ARIMA (NO xregs, requires cnt ~ 1)
arima_model <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(cnt ~ dteday, data = data)
## frequency = 7 observations per 1 week
# ARIMAX with xregs
arimax_model <- arima_reg() %>%
set_engine("auto_arima") %>%
fit(cnt ~ dteday + temp_c + hum + windspeed, data = data)
## frequency = 7 observations per 1 week
model_tbl <- modeltime_table(arima_model, arimax_model)
refit_tbl <- model_tbl %>%
modeltime_refit(data = data)
## frequency = 7 observations per 1 week
## frequency = 7 observations per 1 week
refit_tbl %>%
modeltime_forecast(
new_data = data,
actual_data = data
) %>%
plot_modeltime_forecast(
.title = "ARIMA vs ARIMAX In-Sample Fit",
.conf_interval_alpha = 0.2
)
## Warning: ✖ Expecting the following names to be in the data frame: .conf_hi, .conf_lo.
## ℹ Proceeding with '.conf_interval_show = FALSE' to visualize the forecast without confidence intervals.
## Alternatively, try using `modeltime_calibrate()` before forecasting to add confidence intervals.
# 5) Accuracy on the calibration window ---------------------------------------
refit_tbl %>%
modeltime_calibrate(new_data = data) %>%
modeltime_accuracy()
## # A tibble: 2 × 9
## .model_id .model_desc .type mae mape mase smape rmse rsq
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 ARIMA(1,1,1)(1,0,2)[7] Fitt… 635. 57.9 0.870 17.2 915. 0.778
## 2 2 REGRESSION WITH ARIMA(2,1… Fitt… 542. 46.7 0.743 15.3 770. 0.842
# Inspect selected specs if curious
arima_model$fit
## Series: outcome
## ARIMA(1,1,1)(1,0,2)[7]
##
## Coefficients:
## ar1 ma1 sar1 sma1 sma2
## 0.3612 -0.9005 0.9106 -0.9035 0.0545
## s.e. 0.0427 0.0205 0.0760 0.0865 0.0417
##
## sigma^2 = 844004: log likelihood = -6014.88
## AIC=12041.75 AICc=12041.87 BIC=12069.31
arimax_model$fit
## Series: outcome
## Regression with ARIMA(2,1,1) errors
##
## Coefficients:
## ar1 ar2 ma1 temp_c hum windspeed
## 0.3570 0.0588 -0.8965 133.6327 -3340.5802 -3270.4342
## s.e. 0.0471 0.0431 0.0271 11.9336 237.0409 387.6349
##
## sigma^2 = 598871: log likelihood = -5888.77
## AIC=11791.54 AICc=11791.7 BIC=11823.7
Interpretation
ARIMA and ARIMAX models are compared in the figure. ARIMAX - the green line - tracks the shape of the actual time series (black). That means that the weather variables (temperature, humidity and wind speed) explain some of the variance. ARIMA is underperforming because - as a univariate model with strong seasonal influence - it is just modelling a long term trend.
The residuals on ARIMAX are lower than AIRMA, as is the AIC. It is the better model. The ARIMAX coefficients show that warmer weather means more rentals but higher humidity or higher wind speed mean fewer rentals. The absolute values of the coefficients are large and have relatively small standard errors, suggesting that the coefficients are statistically significant. The magnitudes are large, suggesting a strong effect. It may also success multicollinearity.
The dataset contains 731 daily observations from the Capital Bikeshare system covering January 2011 to December 2012.
Features include date, weather, seasonal factors, and total bike rental counts (cnt), with casual and registered user breakdowns.
The smoothed time series plot of daily rentals (cnt) shows a clear upward trend over time.
STL decomposition revealed a weekly seasonal pattern, with higher rentals on specific days of the week.
However, the seasonal component is less dominant than the trend and random fluctuation components, especially in the first year.
ACF of the original series showed a strong lag-1 autocorrelation (~0.9) and a slow, exponential decay across further lags, indicating non-stationarity.
PACF showed a significant spike at lag 1 with all other lags close to zero — typical for an AR(1) process.
After differencing (first and seasonal), ACF/PACF diagnostics confirmed the series became more stationary and suitable for ARIMA modeling.
ARIMA(1,1,1)(1,0,2)[7]:
This model captures trend and weekly cycles.
AIC = 12,041.75; residual variance (σ²) = 844,004
In-sample fit shows reasonable tracking but fails to adapt to late-2012 rental drops.
ARIMAX (with temperature, humidity, windspeed):
Regression with ARIMA(2,1,1) errors and weather covariates.
AIC = 11,791.54; residual variance = 598,871
Weather variables significantly improve explanatory power:
temp_c: positive influence
humidity and wind speed: strong negative effects
ARIMAX substantially outperforms ARIMA, indicating that weather is a critical driver of daily bike rental counts.
A 30-day forecast was generated using the ARIMA model.
The forecast captured the ongoing upward trend and preserved weekly seasonality in projected values.
The forecast interval provides useful insights into expected rental demand and uncertainty bounds.
The time series of daily bike rentals is non-stationary and shows both trend and weekly seasonality.
ARIMAX clearly outperforms ARIMA for short-term forecasting due to the inclusion of weather effects.
This model could be used to support short-term operational planning (e.g. staffing, bike redistribution) within the bikeshare system.